None
Сервис по продаже автомобилей с пробегом «Не бит, не крашен» разрабатывает приложение для привлечения новых клиентов. В нём можно быстро узнать рыночную стоимость своего автомобиля. В вашем распоряжении исторические данные: технические характеристики, комплектации и цены автомобилей. Нам нужно построить модель для определения стоимости.
Заказчику важны:
Признаки:
DateCrawled — дата скачивания анкеты из базыVehicleType — тип автомобильного кузоваRegistrationYear — год регистрации автомобиляGearbox — тип коробки передачPower — мощность (л. с.)Model — модель автомобиляKilometer — пробег (км)RegistrationMonth — месяц регистрации автомобиляFuelType — тип топливаBrand — марка автомобиляRepaired — была машина в ремонте или нетDateCreated — дата создания анкетыNumberOfPictures — количество фотографий автомобиляPostalCode — почтовый индекс владельца анкеты (пользователя)LastSeen — дата последней активности пользователяЦелевой признак:
# без этого не работает ydata_profiling
!pip install --upgrade pillow
Requirement already satisfied: pillow in /opt/conda/lib/python3.9/site-packages (9.5.0)
!pip install ydata_profiling -U
Requirement already satisfied: ydata_profiling in /opt/conda/lib/python3.9/site-packages (4.2.0) Requirement already satisfied: dacite>=1.8 in /opt/conda/lib/python3.9/site-packages (from ydata_profiling) (1.8.1) Requirement already satisfied: seaborn<0.13,>=0.10.1 in /opt/conda/lib/python3.9/site-packages (from ydata_profiling) (0.11.1) Requirement already satisfied: typeguard<3,>=2.13.2 in /opt/conda/lib/python3.9/site-packages (from ydata_profiling) (2.13.3) Requirement already satisfied: wordcloud>=1.9.1 in /opt/conda/lib/python3.9/site-packages (from ydata_profiling) (1.9.2) Requirement already satisfied: pydantic<2,>=1.8.1 in /opt/conda/lib/python3.9/site-packages (from ydata_profiling) (1.8.2) Requirement already satisfied: imagehash==4.3.1 in /opt/conda/lib/python3.9/site-packages (from ydata_profiling) (4.3.1) Requirement already satisfied: PyYAML<6.1,>=5.0.0 in /opt/conda/lib/python3.9/site-packages (from ydata_profiling) (6.0) Requirement already satisfied: htmlmin==0.1.12 in /opt/conda/lib/python3.9/site-packages (from ydata_profiling) (0.1.12) Requirement already satisfied: numpy<1.24,>=1.16.0 in /opt/conda/lib/python3.9/site-packages (from ydata_profiling) (1.21.1) Requirement already satisfied: phik<0.13,>=0.11.1 in /opt/conda/lib/python3.9/site-packages (from ydata_profiling) (0.12.3) Requirement already satisfied: pandas!=1.4.0,<2,>1.1 in /opt/conda/lib/python3.9/site-packages (from ydata_profiling) (1.2.4) Requirement already satisfied: matplotlib<4,>=3.2 in /opt/conda/lib/python3.9/site-packages (from ydata_profiling) (3.3.4) Requirement already satisfied: statsmodels<1,>=0.13.2 in /opt/conda/lib/python3.9/site-packages (from ydata_profiling) (0.13.2) Requirement already satisfied: tqdm<5,>=4.48.2 in /opt/conda/lib/python3.9/site-packages (from ydata_profiling) (4.61.2) Requirement already satisfied: requests<3,>=2.24.0 in /opt/conda/lib/python3.9/site-packages (from ydata_profiling) (2.25.1) Requirement already satisfied: jinja2<3.2,>=2.11.1 in /opt/conda/lib/python3.9/site-packages (from ydata_profiling) (3.0.1) Requirement already satisfied: visions[type_image_path]==0.7.5 in /opt/conda/lib/python3.9/site-packages (from ydata_profiling) (0.7.5) Requirement already satisfied: scipy<1.11,>=1.4.1 in /opt/conda/lib/python3.9/site-packages (from ydata_profiling) (1.9.1) Requirement already satisfied: multimethod<2,>=1.4 in /opt/conda/lib/python3.9/site-packages (from ydata_profiling) (1.9.1) Requirement already satisfied: pillow in /opt/conda/lib/python3.9/site-packages (from imagehash==4.3.1->ydata_profiling) (9.5.0) Requirement already satisfied: PyWavelets in /opt/conda/lib/python3.9/site-packages (from imagehash==4.3.1->ydata_profiling) (1.4.1) Requirement already satisfied: attrs>=19.3.0 in /opt/conda/lib/python3.9/site-packages (from visions[type_image_path]==0.7.5->ydata_profiling) (21.2.0) Requirement already satisfied: tangled-up-in-unicode>=0.0.4 in /opt/conda/lib/python3.9/site-packages (from visions[type_image_path]==0.7.5->ydata_profiling) (0.2.0) Requirement already satisfied: networkx>=2.4 in /opt/conda/lib/python3.9/site-packages (from visions[type_image_path]==0.7.5->ydata_profiling) (3.1) Requirement already satisfied: MarkupSafe>=2.0 in /opt/conda/lib/python3.9/site-packages (from jinja2<3.2,>=2.11.1->ydata_profiling) (2.1.1) Requirement already satisfied: cycler>=0.10 in /opt/conda/lib/python3.9/site-packages (from matplotlib<4,>=3.2->ydata_profiling) (0.11.0) Requirement already satisfied: kiwisolver>=1.0.1 in /opt/conda/lib/python3.9/site-packages (from matplotlib<4,>=3.2->ydata_profiling) (1.4.4) Requirement already satisfied: python-dateutil>=2.1 in /opt/conda/lib/python3.9/site-packages (from matplotlib<4,>=3.2->ydata_profiling) (2.8.1) Requirement already satisfied: pyparsing!=2.0.4,!=2.1.2,!=2.1.6,>=2.0.3 in /opt/conda/lib/python3.9/site-packages (from matplotlib<4,>=3.2->ydata_profiling) (2.4.7) Requirement already satisfied: pytz>=2017.3 in /opt/conda/lib/python3.9/site-packages (from pandas!=1.4.0,<2,>1.1->ydata_profiling) (2021.1) Requirement already satisfied: joblib>=0.14.1 in /opt/conda/lib/python3.9/site-packages (from phik<0.13,>=0.11.1->ydata_profiling) (1.1.0) Requirement already satisfied: typing-extensions>=3.7.4.3 in /opt/conda/lib/python3.9/site-packages (from pydantic<2,>=1.8.1->ydata_profiling) (4.3.0) Requirement already satisfied: six>=1.5 in /opt/conda/lib/python3.9/site-packages (from python-dateutil>=2.1->matplotlib<4,>=3.2->ydata_profiling) (1.16.0) Requirement already satisfied: urllib3<1.27,>=1.21.1 in /opt/conda/lib/python3.9/site-packages (from requests<3,>=2.24.0->ydata_profiling) (1.26.6) Requirement already satisfied: idna<3,>=2.5 in /opt/conda/lib/python3.9/site-packages (from requests<3,>=2.24.0->ydata_profiling) (2.10) Requirement already satisfied: certifi>=2017.4.17 in /opt/conda/lib/python3.9/site-packages (from requests<3,>=2.24.0->ydata_profiling) (2022.6.15) Requirement already satisfied: chardet<5,>=3.0.2 in /opt/conda/lib/python3.9/site-packages (from requests<3,>=2.24.0->ydata_profiling) (4.0.0) Requirement already satisfied: patsy>=0.5.2 in /opt/conda/lib/python3.9/site-packages (from statsmodels<1,>=0.13.2->ydata_profiling) (0.5.2) Requirement already satisfied: packaging>=21.3 in /opt/conda/lib/python3.9/site-packages (from statsmodels<1,>=0.13.2->ydata_profiling) (21.3)
!pip install optuna
Requirement already satisfied: optuna in /opt/conda/lib/python3.9/site-packages (3.2.0) Requirement already satisfied: colorlog in /opt/conda/lib/python3.9/site-packages (from optuna) (6.7.0) Requirement already satisfied: cmaes>=0.9.1 in /opt/conda/lib/python3.9/site-packages (from optuna) (0.9.1) Requirement already satisfied: packaging>=20.0 in /opt/conda/lib/python3.9/site-packages (from optuna) (21.3) Requirement already satisfied: PyYAML in /opt/conda/lib/python3.9/site-packages (from optuna) (6.0) Requirement already satisfied: tqdm in /opt/conda/lib/python3.9/site-packages (from optuna) (4.61.2) Requirement already satisfied: sqlalchemy>=1.3.0 in /opt/conda/lib/python3.9/site-packages (from optuna) (1.4.20) Requirement already satisfied: numpy in /opt/conda/lib/python3.9/site-packages (from optuna) (1.21.1) Requirement already satisfied: alembic>=1.5.0 in /opt/conda/lib/python3.9/site-packages (from optuna) (1.6.5) Requirement already satisfied: python-editor>=0.3 in /opt/conda/lib/python3.9/site-packages (from alembic>=1.5.0->optuna) (1.0.4) Requirement already satisfied: python-dateutil in /opt/conda/lib/python3.9/site-packages (from alembic>=1.5.0->optuna) (2.8.1) Requirement already satisfied: Mako in /opt/conda/lib/python3.9/site-packages (from alembic>=1.5.0->optuna) (1.1.4) Requirement already satisfied: pyparsing!=3.0.5,>=2.0.2 in /opt/conda/lib/python3.9/site-packages (from packaging>=20.0->optuna) (2.4.7) Requirement already satisfied: greenlet!=0.4.17 in /opt/conda/lib/python3.9/site-packages (from sqlalchemy>=1.3.0->optuna) (1.1.0) Requirement already satisfied: MarkupSafe>=0.9.2 in /opt/conda/lib/python3.9/site-packages (from Mako->alembic>=1.5.0->optuna) (2.1.1) Requirement already satisfied: six>=1.5 in /opt/conda/lib/python3.9/site-packages (from python-dateutil->alembic>=1.5.0->optuna) (1.16.0)
!pip install lightgbm
Requirement already satisfied: lightgbm in /opt/conda/lib/python3.9/site-packages (3.3.1) Requirement already satisfied: scipy in /opt/conda/lib/python3.9/site-packages (from lightgbm) (1.9.1) Requirement already satisfied: numpy in /opt/conda/lib/python3.9/site-packages (from lightgbm) (1.21.1) Requirement already satisfied: wheel in /opt/conda/lib/python3.9/site-packages (from lightgbm) (0.36.2) Requirement already satisfied: scikit-learn!=0.22.0 in /opt/conda/lib/python3.9/site-packages (from lightgbm) (0.24.1) Requirement already satisfied: threadpoolctl>=2.0.0 in /opt/conda/lib/python3.9/site-packages (from scikit-learn!=0.22.0->lightgbm) (3.1.0) Requirement already satisfied: joblib>=0.11 in /opt/conda/lib/python3.9/site-packages (from scikit-learn!=0.22.0->lightgbm) (1.1.0)
import pandas as pd
# для анализа данных
from ydata_profiling import ProfileReport
import numpy as np
import time
# для графиков
import matplotlib.pyplot as plt
import seaborn as sns
# для скрытия ошибок
import warnings
from IPython.display import display
#для загрузки данных и с сервера и локально
import os
# регулярные выражения
import re
# распределение Стьюдента
from scipy import stats as st
# оптимизация параметров моделей
import optuna
# from optuna.visualization import plot_parallel_coordinate
# для разбиение на выборки
from sklearn.model_selection import train_test_split
# метрика MSE
from sklearn.metrics import mean_squared_error
# для раздельной предобработки количествнных и категорийных признаков
from sklearn.compose import ColumnTransformer
# Линейная регрессия
from sklearn.linear_model import LinearRegression
# Ridge регрессия
from sklearn.linear_model import Ridge
# SGD регрессия
from sklearn.linear_model import SGDRegressor
# регрессор Дерево Решений
from sklearn.tree import DecisionTreeRegressor
# регрессор Случайный Лес
from sklearn.ensemble import RandomForestRegressor
# для создания конвеера/ трудопровода
from sklearn.pipeline import Pipeline, make_pipeline
# для маштабирования признаков
from sklearn.preprocessing import StandardScaler
# заполнение пропусков
from sklearn.impute import SimpleImputer
# обработка категориальных признаков
from sklearn.preprocessing import OrdinalEncoder
# обработка категориальных признаков
from sklearn.preprocessing import OneHotEncoder
# градиентный бустинг /LightGBM
import lightgbm as lgb
# для скрытия ошибок
warnings.filterwarnings("ignore")
# для увеличения окна вывода
pd.options.display.max_rows = 300
# максимальная вместимость ячейки для отображения
pd.set_option('display.max_colwidth', None)
# четыре знака после запятой
pd.set_option('display.float_format', '{:.4f}'.format)
# для задания размера графиков по умолчанию
plt.rcParams["figure.figsize"] = (7, 5)
# для псевдорандомных значений
np.random.seed(seed=321)
np.random.RandomState(seed=321)
RandomState(MT19937) at 0x7F41843D9240
#путь с сервероной версии
pth1 = '/datasets/autos.csv'
# моя локальная версия
pth2 = 'autos.csv'
# загрузка данных
if os.path.exists(pth1):
df = pd.read_csv(pth1)
elif os.path.exists(pth2):
df = pd.read_csv(pth2)
else:
display('Проверьте пути к файлам с данными!')
df.head()
| DateCrawled | Price | VehicleType | RegistrationYear | Gearbox | Power | Model | Kilometer | RegistrationMonth | FuelType | Brand | Repaired | DateCreated | NumberOfPictures | PostalCode | LastSeen | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 2016-03-24 11:52:17 | 480 | NaN | 1993 | manual | 0 | golf | 150000 | 0 | petrol | volkswagen | NaN | 2016-03-24 00:00:00 | 0 | 70435 | 2016-04-07 03:16:57 |
| 1 | 2016-03-24 10:58:45 | 18300 | coupe | 2011 | manual | 190 | NaN | 125000 | 5 | gasoline | audi | yes | 2016-03-24 00:00:00 | 0 | 66954 | 2016-04-07 01:46:50 |
| 2 | 2016-03-14 12:52:21 | 9800 | suv | 2004 | auto | 163 | grand | 125000 | 8 | gasoline | jeep | NaN | 2016-03-14 00:00:00 | 0 | 90480 | 2016-04-05 12:47:46 |
| 3 | 2016-03-17 16:54:04 | 1500 | small | 2001 | manual | 75 | golf | 150000 | 6 | petrol | volkswagen | no | 2016-03-17 00:00:00 | 0 | 91074 | 2016-03-17 17:40:17 |
| 4 | 2016-03-31 17:25:20 | 3600 | small | 2008 | manual | 69 | fabia | 90000 | 7 | gasoline | skoda | no | 2016-03-31 00:00:00 | 0 | 60437 | 2016-04-06 10:17:21 |
df.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 354369 entries, 0 to 354368 Data columns (total 16 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 DateCrawled 354369 non-null object 1 Price 354369 non-null int64 2 VehicleType 316879 non-null object 3 RegistrationYear 354369 non-null int64 4 Gearbox 334536 non-null object 5 Power 354369 non-null int64 6 Model 334664 non-null object 7 Kilometer 354369 non-null int64 8 RegistrationMonth 354369 non-null int64 9 FuelType 321474 non-null object 10 Brand 354369 non-null object 11 Repaired 283215 non-null object 12 DateCreated 354369 non-null object 13 NumberOfPictures 354369 non-null int64 14 PostalCode 354369 non-null int64 15 LastSeen 354369 non-null object dtypes: int64(7), object(9) memory usage: 43.3+ MB
Вывод: Данные загрузились корректно.
Приведем название колонок к питоновскому формату.
df.columns = ['_'.join(re.sub(r'([A-Z])', r' \1', col).split()).lower() for col in df.columns]
# df.columns = [re.sub( '(?<!^)(?=[A-Z])', '_', col).lower() for col in df.columns]
df.columns
Index(['date_crawled', 'price', 'vehicle_type', 'registration_year', 'gearbox',
'power', 'model', 'kilometer', 'registration_month', 'fuel_type',
'brand', 'repaired', 'date_created', 'number_of_pictures',
'postal_code', 'last_seen'],
dtype='object')
profile_df = ProfileReport(df)
profile_df.to_notebook_iframe()
# profile_df.to_html()
Summarize dataset: 0%| | 0/5 [00:00<?, ?it/s]
Generate report structure: 0%| | 0/1 [00:00<?, ?it/s]
Render HTML: 0%| | 0/1 [00:00<?, ?it/s]
Вывод: По результатам исследования даннных для подготовки к машинному обучению нам нужно предпринять следующие действия:
registration_year, registration_month, powerprice либо пропуски, либо 0, либо аномально низкие значенияПри обучении моделей мы будем делать дополнительную предобработку данных с учётом их особенностей работы:
Скорее всего эти дубликаты возникли из-за слияния баз данных.
df = df.drop_duplicates(keep='first')
df.shape
(354365, 16)
Вывод: удалено 4 объекта с дубликатами
Какие же у нас столбцы?
df.columns
Index(['date_crawled', 'price', 'vehicle_type', 'registration_year', 'gearbox',
'power', 'model', 'kilometer', 'registration_month', 'fuel_type',
'brand', 'repaired', 'date_created', 'number_of_pictures',
'postal_code', 'last_seen'],
dtype='object')
Столбцы для удаления:
date_ceawled - дата скачивания анкеты из базыdate_created - дата создания анкетыnunumber_of_pictures - количество фотографий (оно везде 0)last_seen - последняя активность владельцаregistration_month - месяц регистрации автомобиляvehicle_type - тип транспортного средстваfuel_type - тип топливаbrand - брэнд (скорее всего назнвания модели достаточно)postal_code - почтовый кодНи один из этих признаков не имеет значительной корреляции с целевым признаком price (<0.2).
columns_for_deleting = ['date_crawled', 'date_created', 'number_of_pictures', 'last_seen'
, 'registration_month'
# , 'vehicle_type', 'fuel_type', 'brand', 'fuel_type', 'model'
]
df.drop(columns=columns_for_deleting, inplace=True)
df.shape
(354365, 11)
Вывод: было удалено 4 столбца не имеющих значения для определения цены.
df[df['price'] == 100].sample(5)
| price | vehicle_type | registration_year | gearbox | power | model | kilometer | fuel_type | brand | repaired | postal_code | |
|---|---|---|---|---|---|---|---|---|---|---|---|
| 115596 | 100 | small | 1998 | manual | 50 | polo | 150000 | petrol | volkswagen | NaN | 89555 |
| 102364 | 100 | NaN | 1995 | manual | 80 | 3_reihe | 150000 | NaN | peugeot | NaN | 24143 |
| 63385 | 100 | small | 1995 | manual | 54 | polo | 150000 | petrol | volkswagen | yes | 8525 |
| 289229 | 100 | sedan | 1991 | manual | 75 | golf | 150000 | NaN | volkswagen | yes | 57567 |
| 172302 | 100 | small | 2002 | manual | 0 | fiesta | 150000 | petrol | ford | yes | 7924 |
Скорее всего символические цены за автомобиль вызваны людским фактором: желанием не платить налоги, ошибкой при заполнении объявления. Бросающиеся в глаза признаки это пробег около 150000 километров и ремонт. Возможно, такие машине действительно ничего не стоят.
5% квантиль показывает, что 5% автомобилей имеют цену менее 200 Евро, при этом 3% имеют цену в 0 Евро.
При имеющихся данных трудно опредилить аномалия это или таков рынок.
Примем решение в пользу того, что таков рынок. Но будем наблюдать за этой особенностью.
# df['registration_year'].describe()
Посмотрим на плотность распределения количества объявлений по годам, а так же на ящик с усами.
fig, ax = plt.subplots(2
, figsize=(18, 18)
)
fig.subplots_adjust(hspace=0.3)
# График kde для года регистрации
x_0 = df['registration_year']
ax[0].set_xlabel('Год регистрации')
ax[0].set_ylabel('Плотность распределения')
ax[0].set_title('Плотность распределения количества объявлений по годам')
ax[0].grid(True)
sns.kdeplot(x_0, ax=ax[0], shade=True, color='blue')
ax[0].set_xlim(1000, 2023+50) # дата выпуска первого автомобиля и настоящее время, Допуск 50 лет
# График boxplot для train
x_1 = df['registration_year']
ax[1].set_xlabel('Год регистрации')
ax[1].set_title('Ящик с усами для года выпуска автомобиля')
ax[1].grid(True)
sns.boxplot(x_1, ax=ax[1], color='blue')
ax[1].set_xlim(1000, 2023+50) # дата выпуска первого автомобиля и настоящее время, допуск 50 лет
plt.show()
Кам мы видим у нас есть объекты вне адекватно диапозона годов регистрации. Восстановить данные мы не сможем. Поэтому удалим их.
df = df.loc[(df['registration_year'] <= 2016) & (df['registration_year'] >= 1886)]
df.shape
(339769, 11)
По данным сайта '''www.drive2.ru''' самым мощьным автомобилем является Transtar Dagger GT (2010 год, США, 2500 л.с; 483 км/ч). Посмотрим, что мы можем купить в нашей базе данных. Ну хотя бы от 550 л.с.
display(df.loc[df['power'] >= 550]['power'].count())
df.loc[df['power'] >= 550].sort_values(by='power').head(10)
378
| price | vehicle_type | registration_year | gearbox | power | model | kilometer | fuel_type | brand | repaired | postal_code | |
|---|---|---|---|---|---|---|---|---|---|---|---|
| 124348 | 19999 | sedan | 2005 | NaN | 550 | other | 150000 | NaN | mercedes_benz | no | 88499 |
| 120559 | 9999 | coupe | 1999 | manual | 550 | golf | 150000 | petrol | volkswagen | no | 72202 |
| 33590 | 300 | coupe | 2002 | manual | 551 | NaN | 5000 | petrol | sonstige_autos | NaN | 65191 |
| 311474 | 10500 | coupe | 1995 | manual | 551 | other | 90000 | petrol | nissan | yes | 32479 |
| 16385 | 100 | small | 1996 | manual | 553 | NaN | 150000 | NaN | renault | yes | 85586 |
| 75230 | 17999 | sedan | 2012 | auto | 560 | m_reihe | 40000 | petrol | bmw | no | 6249 |
| 243754 | 13500 | sedan | 1994 | manual | 572 | vectra | 20000 | petrol | opel | no | 63165 |
| 209751 | 1 | NaN | 1995 | NaN | 574 | 3er | 150000 | NaN | bmw | NaN | 6846 |
| 307452 | 18200 | wagon | 2008 | auto | 579 | other | 150000 | petrol | audi | yes | 21129 |
| 105658 | 19000 | wagon | 2008 | auto | 579 | other | 150000 | petrol | audi | yes | 21129 |
Мой выбор Рено за 100 ЕВРО: 553 л.с.!!!
Похоже, что многие машины были форсированы в 10, а то и в 100 раз. К сожалению, мы не обладаем экспертными данными, чтобы правильно разобраться в этом вопросе. Удалим еще 378 объявлений.
df = df.loc[df['power'] < 550]
df.shape
(339391, 11)
После подготовки данных, ещё раз сформулируем задачу.
Целевой признак:
price - числовойПризнаки:
vehicle_type - категорийныйregistration_year - числовойgearbox - категорийныйpower - числовойmodel - категорийныйkilometer - числовойregistration_month - числовойfuel_type - категорийныйbrand - категорийныйrepaired - категорийныйpostal_code - категорийныйНужно найти модель, чтобы:
Одна из моделей должна быть — LightGBM, и как минимум одна — не бустинг.
Таким образом мы будем решать задачу регрессии. У нас 11 признаков, 7 из которрых категорийные, а 4 числовые.
Функцией потерь у нас будет *RMSE*.
Для уменьшения скорости и времени мы можем пойти двумя путями:
Исходя из этого мы рассмотрим и оптимизируем следующие модели:
Для этих моделей мы подберем оптимальные параметры. Проанализируем результаты на валидационной/проверочной выборке. По каждому из трех параметров выставим место. Победителем будет та модель, у которой сумма мест будет минимальным.
Лучшую модель проверим на тестовой выборке.
В этом проекте для вычеслений и выборок будет использоваться механизм генератора псевдослучайных значений. Для нашего проекта мы выбрали этот параметр random_state равным 321 для всего проекта.
rs = 321
RMSE - это квадратный корень из средней квадратичной ошибки. MSE - средняя квадратичная ошибка. Тоесть при оптимизации MSE мы автоматически оптимизируем RMSE. Следовательно, нашей функцией потерь могут быть как MSE так и RMSE, а метрикой RMSE, которую мы можем легко получить, взяв квадратный корень (либо squared=False в mean_squared_error).
Результаты основных критериев:
Мы будем собирать в специальный датафрейм/таблицу. Для фиксации результатов напишем специальные функции.
Особенность: мы записываем результаты по времени wall time не только работы самой модели, но и предобработки данных как для обучения, так и для предсказания.
Таблица с результатами.
# Наша таблица с результатами
results_table = pd.DataFrame(columns=['model_or_pipeline', 'fit_wall_time', 'predict_wall_time', 'rmse_valid_result'])
results_table
| model_or_pipeline | fit_wall_time | predict_wall_time | rmse_valid_result |
|---|
Функция для записи результатов работы основного числа моделей.
# индек строки таблицы с результатами
i = 0
# функция замеров
def measure_for_models(X_train, y_train, X_valid, y_valid, model, model_name, i):
results_table.loc[i, 'model_or_pipeline'] = model_name
start = time.time()
model.fit(X_train, y_train)
end = time.time()
fit_time = end - start
results_table.loc[i, 'fit_wall_time'] = fit_time
start = time.time()
y_pred = model.predict(X_valid)
end = time.time()
pred_time = end - start
results_table.loc[i, 'predict_wall_time'] = pred_time
rmse = mean_squared_error(y_valid, y_pred, squared = False)
results_table.loc[i, 'rmse_valid_result'] = rmse
i += 1
return i
Функция для записи результатов работы LightGBM модели.
def measure_for_lgbm(X_train, y_train, X_valid, y_valid, model, model_name, i, cat_features):
results_table.loc[i, 'model_or_pipeline'] = model_name
start = time.time()
model.fit(X_train, y_train, regressor__categorical_feature=cat_features)
end = time.time()
fit_time = end - start
results_table.loc[i, 'fit_wall_time'] = fit_time
start = time.time()
y_pred = model.predict(X_valid)
end = time.time()
pred_time = end - start
results_table.loc[i, 'predict_wall_time'] = pred_time
rmse = mean_squared_error(y_valid, y_pred, squared = False)
results_table.loc[i, 'rmse_valid_result'] = rmse
i += 1
return i
По условияю задачи целевой признак price.
target = df['price']
target.describe()
count 339391.00 mean 4472.43 std 4546.13 min 0.00 25% 1099.00 50% 2799.00 75% 6500.00 max 20000.00 Name: price, dtype: float64
А обучающими будут, те, что остались после нашей подготовки данных и не являются целевыми.
features = df.drop(columns='price')
features.info()
<class 'pandas.core.frame.DataFrame'> Int64Index: 339391 entries, 0 to 354368 Data columns (total 10 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 vehicle_type 316509 non-null object 1 registration_year 339391 non-null int64 2 gearbox 321501 non-null object 3 power 339391 non-null int64 4 model 321813 non-null object 5 kilometer 339391 non-null int64 6 fuel_type 312213 non-null object 7 brand 339391 non-null object 8 repaired 274730 non-null object 9 postal_code 339391 non-null int64 dtypes: int64(4), object(6) memory usage: 28.5+ MB
Вывод: целевой признак и обучающие признаки выделены корректно.
Нам сильно повезло: заказчик не выдвинул требований к проверочным выборкам и предоставил много данных. Поэтому мы разобьём данные по класической схеме:
# 20% от 100%
features_train, features_test, target_train, target_test = train_test_split(features, target
, test_size=0.2
, random_state=rs
, shuffle=True
# , stratify=target
)
#25% от 80%
features_train, features_valid, target_train, target_valid = train_test_split(features_train, target_train
, test_size=0.25
, random_state=rs
, shuffle=True
# , stratify=target_train
)
display(features.shape, features_train.shape, features_valid.shape, features_test.shape)
(339391, 10)
(203634, 10)
(67878, 10)
(67879, 10)
display(target.shape, target_train.shape, target_valid.shape, target_test.shape)
(339391,)
(203634,)
(67878,)
(67879,)
# target_train.describe()
# target_valid.describe()
# target_test.describe()
Вывод: разбиение на тренировочную, валидационную/проверочную и тестовую выборки прошло успешно.
Для поиска оптимальных параметров для модели мы будем пользоваться библиотекой Optuna.
Каждый поиск мы будем ограничивать 10 минутами или 600 секундами, и не более чем 100 попытками.
Как упомяналось ранее мы для каждой модели будем делать предобработку данных: заполнять пропуски, маштабировать и так далее...
Запишем категориальные признаки.
category_features = ['gearbox', 'repaired'
, 'model'
, 'vehicle_type', 'fuel_type', 'brand'
]
category_features
['gearbox', 'repaired', 'model', 'vehicle_type', 'fuel_type', 'brand']
Запишем индексы колонок категориальных признаков(LightGBM).
category_features_index = []
for col in category_features:
category_features_index.append(df.columns.get_loc(col))
category_features_index
[3, 9, 5, 1, 7, 8]
Запишем количественные признаки.
num_features = ['registration_year', 'power', 'kilometer'
# , 'registration_month'
]
num_features
['registration_year', 'power', 'kilometer']
Подготовим конвееры/pipeline для предобработки данных, которые будут учитывать базовый алгоритм и отдельно обрабатывать количественные и категорийные признаки.
# задаем шаги в Pipeline
# обработка численных признаков для базового алгоритма Линейная регрессия
# заполнение 0 и стандартизация с маштабированием
num_steps_linear = [('imputer', SimpleImputer(missing_values=np.nan
, strategy='constant'
, fill_value=0
, add_indicator=True
)
)
, ('scaler', StandardScaler()
)
]
num_preprocessor_linear = Pipeline(num_steps_linear)
# обработка численных признаков для базового алгоритма Дерево решений
# заполнение аномальными значениями
num_steps_tree = [('imputer', SimpleImputer(missing_values=np.nan
, strategy='constant'
, fill_value=-1000
)
)
]
num_preprocessor_tree = Pipeline(num_steps_tree)
# обработка категориальных признаков для базового алгоритма Линейная регрессия
category_steps_linear = [('imputer', SimpleImputer(missing_values=np.nan, strategy='constant', fill_value='unknown'))
, ('encoder', OrdinalEncoder(handle_unknown = 'ignore'))
, ('scaler', StandardScaler())
]
# category_steps_linear = [('imputer', SimpleImputer(missing_values=np.nan, strategy='constant', fill_value='unknown'))
# , ('ohe', OneHotEncoder(drop='first'
# # , categories=category_features
# ))
# ]
category_preprocessor_linear = Pipeline(category_steps_linear)
# обработка категориальных признаков для базового алгоритма Дерово решений
category_steps_tree = [('imputer', SimpleImputer(missing_values=np.nan, strategy='constant', fill_value='unknown'))
,('encoder', OrdinalEncoder(handle_unknown = 'ignore', dtype=np.uint32))
# , ('scaler', StandardScaler())
]
category_preprocessor_tree = Pipeline(category_steps_tree)
# собираем все вместе для базового алгоритма Линейная регрессия
preprocessor_linear = ColumnTransformer(transformers=[('num', num_preprocessor_linear, num_features)
, ('cat', category_preprocessor_linear, category_features)
]
, remainder='passthrough'
)
# собираем все вместе для базового алгоритма Дерево Решений
preprocessor_tree = ColumnTransformer(transformers=[('num', num_preprocessor_tree, num_features)
, ('cat', category_preprocessor_tree, category_features)
]
, remainder='passthrough'
)
Вывод: мы подготовили коневееры/pipeline для обработки данных для каждой модели.
category_preprocessor_linear.fit_transform(features_train).shape
(203634, 10)
Подберём параметры для Линейной регрессии.
def objective(trial):
fit_intercept = trial.suggest_categorical('fit_intercept', [False, True])
positive = trial.suggest_categorical('positive', [False, True])
pipeline_linear_regresion = Pipeline([('preprocessor', preprocessor_linear)
, ('regressor', LinearRegression(fit_intercept=fit_intercept
, copy_X=True, n_jobs=-1
, positive=positive))])
pipeline_linear_regresion.fit(features_train, target_train)
prediction_train = pipeline_linear_regresion.predict(features_train)
rmse = mean_squared_error(target_train, prediction_train, squared=False)
# mse = mean_squared_error(target_train, pipeline_linear_regresion.predict(features_train))
return rmse
study_linear_regression = optuna.create_study(direction='minimize')
study_linear_regression.optimize(objective, n_trials=4)
study_linear_regression.best_trial
[I 2023-06-12 12:39:59,526] A new study created in memory with name: no-name-b6e8429c-883f-4083-bb97-726215a299a6
[I 2023-06-12 12:40:00,406] Trial 0 finished with value: 3666.412009773475 and parameters: {'fit_intercept': False, 'positive': False}. Best is trial 0 with value: 3666.412009773475.
[I 2023-06-12 12:40:01,437] Trial 1 finished with value: 3666.412009773475 and parameters: {'fit_intercept': False, 'positive': False}. Best is trial 0 with value: 3666.412009773475.
[I 2023-06-12 12:40:02,377] Trial 2 finished with value: 4078.7990202471888 and parameters: {'fit_intercept': False, 'positive': True}. Best is trial 0 with value: 3666.412009773475.
[I 2023-06-12 12:40:03,403] Trial 3 finished with value: 3126.2790320265517 and parameters: {'fit_intercept': True, 'positive': False}. Best is trial 3 with value: 3126.2790320265517.
FrozenTrial(number=3, state=TrialState.COMPLETE, values=[3126.2790320265517], datetime_start=datetime.datetime(2023, 6, 12, 12, 40, 2, 378619), datetime_complete=datetime.datetime(2023, 6, 12, 12, 40, 3, 402866), params={'fit_intercept': True, 'positive': False}, user_attrs={}, system_attrs={}, intermediate_values={}, distributions={'fit_intercept': CategoricalDistribution(choices=(False, True)), 'positive': CategoricalDistribution(choices=(False, True))}, trial_id=3, value=None)
Запишем показатели в нашу таблицу.
study_linear_regression.best_params
pipeline_linear_regression = Pipeline([('preprocessor', preprocessor_linear)
, ('regressor', LinearRegression(n_jobs=-1
, fit_intercept=study_linear_regression.best_params['fit_intercept']
, positive=study_linear_regression.best_params['positive']))])
i = measure_for_models(features_train, target_train, features_valid, target_valid
, pipeline_linear_regression
, 'LinearRegression. Параметры:'+str(study_linear_regression.best_params)
, i
)
results_table
| model_or_pipeline | fit_wall_time | predict_wall_time | rmse_valid_result | |
|---|---|---|---|---|
| 0 | LinearRegression. Параметры:{'fit_intercept': True, 'positive': False} | 0.50 | 0.22 | 3133.76 |
| 1 | Ridge. Параметры:{'fit_intercept': True, 'positive': False, 'alpha': 2.69, 'max_iter': 3000} | 0.49 | 0.29 | 4529.75 |
| 2 | SGDRegressor. Параметры:{'max_iter': 6000, 'alpha': 0.15000000000000002} | 21.00 | 0.11 | 27112617069432432.00 |
| 3 | DecisionTreeRegressor. Параметры:{'max_depth': 15} | 1.28 | 0.13 | 2078.77 |
| 4 | RandomForestRegressor. Параметры:{'max_depth': 15, 'n_estimators': 100} | 60.40 | 1.25 | 1769.28 |
| 5 | LGBMRegressor. Параметры:{'n_estimators': 310, 'max_depth': 9, 'min_split_gain': 13.613306298597603, 'num_leaves': 2660} | 971.28 | 10.91 | 1691.10 |
| 6 | LinearRegression. Параметры:{'fit_intercept': True, 'positive': False} | 0.47 | 0.23 | 3133.76 |
| 7 | LinearRegression. Параметры:{'fit_intercept': True, 'positive': False} | 0.47 | 0.23 | 3133.76 |
Подберем параметры для Ridge. Эта модель будет использовать градиентный спуск и регуляризацию.
def objective(trial):
fit_intercept = trial.suggest_categorical('fit_intercept', [False, True])
positive = trial.suggest_categorical('positive', [False, True])
alpha = trial.suggest_float('alpha', 0.01, 3
, step = 0.01
# , log=True
)
max_iter = trial.suggest_int('max_iter', 1000, 15000, step=1000)
pipeline_ridge_regresion = Pipeline([('preprocessor', preprocessor_linear)
, ('regressor', Ridge(alpha=alpha
, fit_intercept=fit_intercept
# , copy_X=True
, max_iter=max_iter
# , tol=0.0001
, solver='lsqr'
# , positive=positive
# , random_state=None
)
)
]
)
pipeline_ridge_regresion.fit(features_train, target_train)
# prediction_valid = pipeline_ridge_regresion.predict(features_valid)
# rmse = mean_squared_error(target_valid, prediction_valid, squared=False)
rmse = mean_squared_error(target_train, pipeline_ridge_regresion.predict(features_train), squared=False)
return rmse
study_ridge_regression = optuna.create_study(direction="minimize"
,sampler=optuna.samplers.RandomSampler(seed=rs)
,pruner=optuna.pruners.MedianPruner(n_warmup_steps=10)
)
study_ridge_regression.optimize(objective, n_trials=100, timeout=600)
study_ridge_regression.best_trial
[I 2023-06-12 12:40:04,340] A new study created in memory with name: no-name-f4d31230-acd5-4a1d-8294-2676127f2333
[I 2023-06-12 12:40:05,343] Trial 0 finished with value: 4855.019597835598 and parameters: {'fit_intercept': False, 'positive': False, 'alpha': 2.26, 'max_iter': 8000}. Best is trial 0 with value: 4855.019597835598.
[I 2023-06-12 12:40:06,434] Trial 1 finished with value: 4855.019597835598 and parameters: {'fit_intercept': False, 'positive': True, 'alpha': 1.04, 'max_iter': 4000}. Best is trial 0 with value: 4855.019597835598.
[I 2023-06-12 12:40:07,530] Trial 2 finished with value: 4855.019597835598 and parameters: {'fit_intercept': False, 'positive': False, 'alpha': 0.3, 'max_iter': 5000}. Best is trial 0 with value: 4855.019597835598.
[I 2023-06-12 12:40:08,726] Trial 3 finished with value: 4855.019597835598 and parameters: {'fit_intercept': False, 'positive': False, 'alpha': 0.35000000000000003, 'max_iter': 13000}. Best is trial 0 with value: 4855.019597835598.
[I 2023-06-12 12:40:09,816] Trial 4 finished with value: 4855.019597835598 and parameters: {'fit_intercept': False, 'positive': True, 'alpha': 0.13, 'max_iter': 6000}. Best is trial 0 with value: 4855.019597835598.
[I 2023-06-12 12:40:10,914] Trial 5 finished with value: 4534.641303285158 and parameters: {'fit_intercept': True, 'positive': False, 'alpha': 2.69, 'max_iter': 3000}. Best is trial 5 with value: 4534.641303285158.
[I 2023-06-12 12:40:12,075] Trial 6 finished with value: 4534.641303285158 and parameters: {'fit_intercept': True, 'positive': True, 'alpha': 2.0, 'max_iter': 3000}. Best is trial 5 with value: 4534.641303285158.
[I 2023-06-12 12:40:13,239] Trial 7 finished with value: 4855.019597835598 and parameters: {'fit_intercept': False, 'positive': False, 'alpha': 1.35, 'max_iter': 6000}. Best is trial 5 with value: 4534.641303285158.
[I 2023-06-12 12:40:14,423] Trial 8 finished with value: 4534.641303285158 and parameters: {'fit_intercept': True, 'positive': False, 'alpha': 2.55, 'max_iter': 10000}. Best is trial 5 with value: 4534.641303285158.
[I 2023-06-12 12:40:15,531] Trial 9 finished with value: 4534.641303285158 and parameters: {'fit_intercept': True, 'positive': True, 'alpha': 0.7000000000000001, 'max_iter': 8000}. Best is trial 5 with value: 4534.641303285158.
[I 2023-06-12 12:40:16,634] Trial 10 finished with value: 4855.019597835598 and parameters: {'fit_intercept': False, 'positive': True, 'alpha': 0.03, 'max_iter': 6000}. Best is trial 5 with value: 4534.641303285158.
[I 2023-06-12 12:40:17,716] Trial 11 finished with value: 4855.019597835598 and parameters: {'fit_intercept': False, 'positive': True, 'alpha': 2.1399999999999997, 'max_iter': 8000}. Best is trial 5 with value: 4534.641303285158.
[I 2023-06-12 12:40:18,854] Trial 12 finished with value: 4534.641303285158 and parameters: {'fit_intercept': True, 'positive': False, 'alpha': 1.96, 'max_iter': 13000}. Best is trial 5 with value: 4534.641303285158.
[I 2023-06-12 12:40:19,924] Trial 13 finished with value: 4534.641303285158 and parameters: {'fit_intercept': True, 'positive': True, 'alpha': 0.48000000000000004, 'max_iter': 2000}. Best is trial 5 with value: 4534.641303285158.
[I 2023-06-12 12:40:21,126] Trial 14 finished with value: 4855.019597835598 and parameters: {'fit_intercept': False, 'positive': False, 'alpha': 2.25, 'max_iter': 1000}. Best is trial 5 with value: 4534.641303285158.
[I 2023-06-12 12:40:22,207] Trial 15 finished with value: 4855.019597835598 and parameters: {'fit_intercept': False, 'positive': True, 'alpha': 1.34, 'max_iter': 3000}. Best is trial 5 with value: 4534.641303285158.
[I 2023-06-12 12:40:23,215] Trial 16 finished with value: 4855.019597835598 and parameters: {'fit_intercept': False, 'positive': True, 'alpha': 0.8200000000000001, 'max_iter': 11000}. Best is trial 5 with value: 4534.641303285158.
[I 2023-06-12 12:40:24,331] Trial 17 finished with value: 4534.641303285158 and parameters: {'fit_intercept': True, 'positive': True, 'alpha': 2.73, 'max_iter': 2000}. Best is trial 5 with value: 4534.641303285158.
[I 2023-06-12 12:40:25,538] Trial 18 finished with value: 4855.019597835598 and parameters: {'fit_intercept': False, 'positive': True, 'alpha': 0.77, 'max_iter': 5000}. Best is trial 5 with value: 4534.641303285158.
[I 2023-06-12 12:40:26,616] Trial 19 finished with value: 4534.641303285158 and parameters: {'fit_intercept': True, 'positive': False, 'alpha': 2.5999999999999996, 'max_iter': 3000}. Best is trial 5 with value: 4534.641303285158.
[I 2023-06-12 12:40:27,847] Trial 20 finished with value: 4534.641303285158 and parameters: {'fit_intercept': True, 'positive': True, 'alpha': 2.6999999999999997, 'max_iter': 2000}. Best is trial 5 with value: 4534.641303285158.
[I 2023-06-12 12:40:28,935] Trial 21 finished with value: 4534.641303285158 and parameters: {'fit_intercept': True, 'positive': True, 'alpha': 1.5, 'max_iter': 9000}. Best is trial 5 with value: 4534.641303285158.
[I 2023-06-12 12:40:30,030] Trial 22 finished with value: 4534.641303285158 and parameters: {'fit_intercept': True, 'positive': False, 'alpha': 1.33, 'max_iter': 5000}. Best is trial 5 with value: 4534.641303285158.
[I 2023-06-12 12:40:31,154] Trial 23 finished with value: 4855.019597835598 and parameters: {'fit_intercept': False, 'positive': False, 'alpha': 1.98, 'max_iter': 4000}. Best is trial 5 with value: 4534.641303285158.
[I 2023-06-12 12:40:32,252] Trial 24 finished with value: 4534.641303285158 and parameters: {'fit_intercept': True, 'positive': True, 'alpha': 1.42, 'max_iter': 5000}. Best is trial 5 with value: 4534.641303285158.
[I 2023-06-12 12:40:33,449] Trial 25 finished with value: 4534.641303285158 and parameters: {'fit_intercept': True, 'positive': False, 'alpha': 0.9600000000000001, 'max_iter': 8000}. Best is trial 5 with value: 4534.641303285158.
[I 2023-06-12 12:40:34,622] Trial 26 finished with value: 4534.641303285158 and parameters: {'fit_intercept': True, 'positive': False, 'alpha': 0.6900000000000001, 'max_iter': 15000}. Best is trial 5 with value: 4534.641303285158.
[I 2023-06-12 12:40:35,735] Trial 27 finished with value: 4534.641303285158 and parameters: {'fit_intercept': True, 'positive': True, 'alpha': 0.98, 'max_iter': 6000}. Best is trial 5 with value: 4534.641303285158.
[I 2023-06-12 12:40:36,835] Trial 28 finished with value: 4534.641303285158 and parameters: {'fit_intercept': True, 'positive': False, 'alpha': 2.23, 'max_iter': 5000}. Best is trial 5 with value: 4534.641303285158.
[I 2023-06-12 12:40:37,949] Trial 29 finished with value: 4534.641303285158 and parameters: {'fit_intercept': True, 'positive': False, 'alpha': 2.4, 'max_iter': 1000}. Best is trial 5 with value: 4534.641303285158.
[I 2023-06-12 12:40:39,055] Trial 30 finished with value: 4534.641303285158 and parameters: {'fit_intercept': True, 'positive': True, 'alpha': 0.2, 'max_iter': 12000}. Best is trial 5 with value: 4534.641303285158.
[I 2023-06-12 12:40:40,255] Trial 31 finished with value: 4534.641303285158 and parameters: {'fit_intercept': True, 'positive': False, 'alpha': 0.62, 'max_iter': 2000}. Best is trial 5 with value: 4534.641303285158.
[I 2023-06-12 12:40:41,345] Trial 32 finished with value: 4855.019597835598 and parameters: {'fit_intercept': False, 'positive': True, 'alpha': 1.06, 'max_iter': 10000}. Best is trial 5 with value: 4534.641303285158.
[I 2023-06-12 12:40:42,453] Trial 33 finished with value: 4855.019597835598 and parameters: {'fit_intercept': False, 'positive': True, 'alpha': 1.2, 'max_iter': 15000}. Best is trial 5 with value: 4534.641303285158.
[I 2023-06-12 12:40:43,564] Trial 34 finished with value: 4534.641303285158 and parameters: {'fit_intercept': True, 'positive': False, 'alpha': 1.1500000000000001, 'max_iter': 11000}. Best is trial 5 with value: 4534.641303285158.
[I 2023-06-12 12:40:44,611] Trial 35 finished with value: 4855.019597835598 and parameters: {'fit_intercept': False, 'positive': True, 'alpha': 2.6599999999999997, 'max_iter': 9000}. Best is trial 5 with value: 4534.641303285158.
[I 2023-06-12 12:40:45,633] Trial 36 finished with value: 4855.019597835598 and parameters: {'fit_intercept': False, 'positive': False, 'alpha': 2.59, 'max_iter': 13000}. Best is trial 5 with value: 4534.641303285158.
[I 2023-06-12 12:40:46,720] Trial 37 finished with value: 4534.641303285158 and parameters: {'fit_intercept': True, 'positive': True, 'alpha': 1.85, 'max_iter': 10000}. Best is trial 5 with value: 4534.641303285158.
[I 2023-06-12 12:40:47,848] Trial 38 finished with value: 4534.641303285158 and parameters: {'fit_intercept': True, 'positive': True, 'alpha': 0.31, 'max_iter': 3000}. Best is trial 5 with value: 4534.641303285158.
[I 2023-06-12 12:40:48,946] Trial 39 finished with value: 4534.641303285158 and parameters: {'fit_intercept': True, 'positive': True, 'alpha': 0.22, 'max_iter': 15000}. Best is trial 5 with value: 4534.641303285158.
[I 2023-06-12 12:40:50,039] Trial 40 finished with value: 4855.019597835598 and parameters: {'fit_intercept': False, 'positive': True, 'alpha': 2.52, 'max_iter': 8000}. Best is trial 5 with value: 4534.641303285158.
[I 2023-06-12 12:40:51,182] Trial 41 finished with value: 4534.641303285158 and parameters: {'fit_intercept': True, 'positive': True, 'alpha': 0.31, 'max_iter': 15000}. Best is trial 5 with value: 4534.641303285158.
[I 2023-06-12 12:40:52,233] Trial 42 finished with value: 4855.019597835598 and parameters: {'fit_intercept': False, 'positive': False, 'alpha': 0.19, 'max_iter': 6000}. Best is trial 5 with value: 4534.641303285158.
[I 2023-06-12 12:40:53,318] Trial 43 finished with value: 4534.641303285158 and parameters: {'fit_intercept': True, 'positive': False, 'alpha': 2.6199999999999997, 'max_iter': 11000}. Best is trial 5 with value: 4534.641303285158.
[I 2023-06-12 12:40:54,445] Trial 44 finished with value: 4534.641303285158 and parameters: {'fit_intercept': True, 'positive': False, 'alpha': 2.55, 'max_iter': 13000}. Best is trial 5 with value: 4534.641303285158.
[I 2023-06-12 12:40:55,668] Trial 45 finished with value: 4534.641303285158 and parameters: {'fit_intercept': True, 'positive': True, 'alpha': 0.88, 'max_iter': 15000}. Best is trial 5 with value: 4534.641303285158.
[I 2023-06-12 12:40:56,826] Trial 46 finished with value: 4534.641303285158 and parameters: {'fit_intercept': True, 'positive': False, 'alpha': 1.1500000000000001, 'max_iter': 13000}. Best is trial 5 with value: 4534.641303285158.
[I 2023-06-12 12:40:58,128] Trial 47 finished with value: 4855.019597835598 and parameters: {'fit_intercept': False, 'positive': False, 'alpha': 1.19, 'max_iter': 3000}. Best is trial 5 with value: 4534.641303285158.
[I 2023-06-12 12:40:59,134] Trial 48 finished with value: 4855.019597835598 and parameters: {'fit_intercept': False, 'positive': False, 'alpha': 2.0999999999999996, 'max_iter': 5000}. Best is trial 5 with value: 4534.641303285158.
[I 2023-06-12 12:41:00,212] Trial 49 finished with value: 4534.641303285158 and parameters: {'fit_intercept': True, 'positive': False, 'alpha': 2.5999999999999996, 'max_iter': 15000}. Best is trial 5 with value: 4534.641303285158.
[I 2023-06-12 12:41:01,366] Trial 50 finished with value: 4534.641303285158 and parameters: {'fit_intercept': True, 'positive': True, 'alpha': 2.2399999999999998, 'max_iter': 10000}. Best is trial 5 with value: 4534.641303285158.
[I 2023-06-12 12:41:02,509] Trial 51 finished with value: 4855.019597835598 and parameters: {'fit_intercept': False, 'positive': True, 'alpha': 0.48000000000000004, 'max_iter': 7000}. Best is trial 5 with value: 4534.641303285158.
[I 2023-06-12 12:41:03,618] Trial 52 finished with value: 4855.019597835598 and parameters: {'fit_intercept': False, 'positive': True, 'alpha': 2.5, 'max_iter': 8000}. Best is trial 5 with value: 4534.641303285158.
[I 2023-06-12 12:41:04,715] Trial 53 finished with value: 4534.641303285158 and parameters: {'fit_intercept': True, 'positive': False, 'alpha': 0.26, 'max_iter': 2000}. Best is trial 5 with value: 4534.641303285158.
[I 2023-06-12 12:41:05,850] Trial 54 finished with value: 4855.019597835598 and parameters: {'fit_intercept': False, 'positive': False, 'alpha': 1.68, 'max_iter': 10000}. Best is trial 5 with value: 4534.641303285158.
[I 2023-06-12 12:41:06,926] Trial 55 finished with value: 4855.019597835598 and parameters: {'fit_intercept': False, 'positive': True, 'alpha': 0.45, 'max_iter': 3000}. Best is trial 5 with value: 4534.641303285158.
[I 2023-06-12 12:41:08,177] Trial 56 finished with value: 4855.019597835598 and parameters: {'fit_intercept': False, 'positive': False, 'alpha': 1.94, 'max_iter': 14000}. Best is trial 5 with value: 4534.641303285158.
[I 2023-06-12 12:41:09,239] Trial 57 finished with value: 4534.641303285158 and parameters: {'fit_intercept': True, 'positive': True, 'alpha': 1.19, 'max_iter': 7000}. Best is trial 5 with value: 4534.641303285158.
[I 2023-06-12 12:41:10,316] Trial 58 finished with value: 4534.641303285158 and parameters: {'fit_intercept': True, 'positive': False, 'alpha': 2.4299999999999997, 'max_iter': 12000}. Best is trial 5 with value: 4534.641303285158.
[I 2023-06-12 12:41:11,481] Trial 59 finished with value: 4534.641303285158 and parameters: {'fit_intercept': True, 'positive': True, 'alpha': 0.23, 'max_iter': 2000}. Best is trial 5 with value: 4534.641303285158.
[I 2023-06-12 12:41:12,641] Trial 60 finished with value: 4534.641303285158 and parameters: {'fit_intercept': True, 'positive': True, 'alpha': 2.6999999999999997, 'max_iter': 4000}. Best is trial 5 with value: 4534.641303285158.
[I 2023-06-12 12:41:13,874] Trial 61 finished with value: 4534.641303285158 and parameters: {'fit_intercept': True, 'positive': True, 'alpha': 2.4, 'max_iter': 10000}. Best is trial 5 with value: 4534.641303285158.
[I 2023-06-12 12:41:14,917] Trial 62 finished with value: 4534.641303285158 and parameters: {'fit_intercept': True, 'positive': False, 'alpha': 1.6400000000000001, 'max_iter': 12000}. Best is trial 5 with value: 4534.641303285158.
[I 2023-06-12 12:41:16,037] Trial 63 finished with value: 4855.019597835598 and parameters: {'fit_intercept': False, 'positive': True, 'alpha': 2.69, 'max_iter': 10000}. Best is trial 5 with value: 4534.641303285158.
[I 2023-06-12 12:41:17,117] Trial 64 finished with value: 4534.641303285158 and parameters: {'fit_intercept': True, 'positive': True, 'alpha': 2.3499999999999996, 'max_iter': 5000}. Best is trial 5 with value: 4534.641303285158.
[I 2023-06-12 12:41:18,216] Trial 65 finished with value: 4855.019597835598 and parameters: {'fit_intercept': False, 'positive': True, 'alpha': 2.0, 'max_iter': 14000}. Best is trial 5 with value: 4534.641303285158.
[I 2023-06-12 12:41:19,339] Trial 66 finished with value: 4534.641303285158 and parameters: {'fit_intercept': True, 'positive': True, 'alpha': 1.54, 'max_iter': 6000}. Best is trial 5 with value: 4534.641303285158.
[I 2023-06-12 12:41:20,405] Trial 67 finished with value: 4534.641303285158 and parameters: {'fit_intercept': True, 'positive': False, 'alpha': 2.2199999999999998, 'max_iter': 8000}. Best is trial 5 with value: 4534.641303285158.
[I 2023-06-12 12:41:21,418] Trial 68 finished with value: 4534.641303285158 and parameters: {'fit_intercept': True, 'positive': True, 'alpha': 1.0, 'max_iter': 5000}. Best is trial 5 with value: 4534.641303285158.
[I 2023-06-12 12:41:22,534] Trial 69 finished with value: 4534.641303285158 and parameters: {'fit_intercept': True, 'positive': False, 'alpha': 1.37, 'max_iter': 9000}. Best is trial 5 with value: 4534.641303285158.
[I 2023-06-12 12:41:23,634] Trial 70 finished with value: 4855.019597835598 and parameters: {'fit_intercept': False, 'positive': True, 'alpha': 1.76, 'max_iter': 3000}. Best is trial 5 with value: 4534.641303285158.
[I 2023-06-12 12:41:24,710] Trial 71 finished with value: 4534.641303285158 and parameters: {'fit_intercept': True, 'positive': True, 'alpha': 0.36000000000000004, 'max_iter': 1000}. Best is trial 5 with value: 4534.641303285158.
[I 2023-06-12 12:41:25,810] Trial 72 finished with value: 4534.641303285158 and parameters: {'fit_intercept': True, 'positive': True, 'alpha': 2.79, 'max_iter': 5000}. Best is trial 5 with value: 4534.641303285158.
[I 2023-06-12 12:41:26,827] Trial 73 finished with value: 4534.641303285158 and parameters: {'fit_intercept': True, 'positive': False, 'alpha': 0.09999999999999999, 'max_iter': 2000}. Best is trial 5 with value: 4534.641303285158.
[I 2023-06-12 12:41:27,917] Trial 74 finished with value: 4855.019597835598 and parameters: {'fit_intercept': False, 'positive': False, 'alpha': 2.4499999999999997, 'max_iter': 15000}. Best is trial 5 with value: 4534.641303285158.
[I 2023-06-12 12:41:29,031] Trial 75 finished with value: 4855.019597835598 and parameters: {'fit_intercept': False, 'positive': False, 'alpha': 2.94, 'max_iter': 5000}. Best is trial 5 with value: 4534.641303285158.
[I 2023-06-12 12:41:30,114] Trial 76 finished with value: 4855.019597835598 and parameters: {'fit_intercept': False, 'positive': True, 'alpha': 2.27, 'max_iter': 2000}. Best is trial 5 with value: 4534.641303285158.
[I 2023-06-12 12:41:31,239] Trial 77 finished with value: 4855.019597835598 and parameters: {'fit_intercept': False, 'positive': True, 'alpha': 2.76, 'max_iter': 2000}. Best is trial 5 with value: 4534.641303285158.
[I 2023-06-12 12:41:32,326] Trial 78 finished with value: 4855.019597835598 and parameters: {'fit_intercept': False, 'positive': False, 'alpha': 2.1999999999999997, 'max_iter': 9000}. Best is trial 5 with value: 4534.641303285158.
[I 2023-06-12 12:41:33,440] Trial 79 finished with value: 4855.019597835598 and parameters: {'fit_intercept': False, 'positive': False, 'alpha': 0.35000000000000003, 'max_iter': 14000}. Best is trial 5 with value: 4534.641303285158.
[I 2023-06-12 12:41:34,518] Trial 80 finished with value: 4534.641303285158 and parameters: {'fit_intercept': True, 'positive': True, 'alpha': 1.34, 'max_iter': 13000}. Best is trial 5 with value: 4534.641303285158.
[I 2023-06-12 12:41:35,618] Trial 81 finished with value: 4855.019597835598 and parameters: {'fit_intercept': False, 'positive': False, 'alpha': 0.02, 'max_iter': 2000}. Best is trial 5 with value: 4534.641303285158.
[I 2023-06-12 12:41:36,732] Trial 82 finished with value: 4855.019597835598 and parameters: {'fit_intercept': False, 'positive': False, 'alpha': 1.5, 'max_iter': 9000}. Best is trial 5 with value: 4534.641303285158.
[I 2023-06-12 12:41:37,843] Trial 83 finished with value: 4855.019597835598 and parameters: {'fit_intercept': False, 'positive': True, 'alpha': 0.14, 'max_iter': 8000}. Best is trial 5 with value: 4534.641303285158.
[I 2023-06-12 12:41:38,918] Trial 84 finished with value: 4855.019597835598 and parameters: {'fit_intercept': False, 'positive': False, 'alpha': 2.54, 'max_iter': 4000}. Best is trial 5 with value: 4534.641303285158.
[I 2023-06-12 12:41:40,026] Trial 85 finished with value: 4534.641303285158 and parameters: {'fit_intercept': True, 'positive': True, 'alpha': 1.48, 'max_iter': 9000}. Best is trial 5 with value: 4534.641303285158.
[I 2023-06-12 12:41:41,152] Trial 86 finished with value: 4534.641303285158 and parameters: {'fit_intercept': True, 'positive': True, 'alpha': 1.61, 'max_iter': 10000}. Best is trial 5 with value: 4534.641303285158.
[I 2023-06-12 12:41:42,338] Trial 87 finished with value: 4855.019597835598 and parameters: {'fit_intercept': False, 'positive': False, 'alpha': 1.84, 'max_iter': 4000}. Best is trial 5 with value: 4534.641303285158.
[I 2023-06-12 12:41:43,471] Trial 88 finished with value: 4534.641303285158 and parameters: {'fit_intercept': True, 'positive': False, 'alpha': 0.65, 'max_iter': 8000}. Best is trial 5 with value: 4534.641303285158.
[I 2023-06-12 12:41:44,512] Trial 89 finished with value: 4534.641303285158 and parameters: {'fit_intercept': True, 'positive': False, 'alpha': 0.2, 'max_iter': 14000}. Best is trial 5 with value: 4534.641303285158.
[I 2023-06-12 12:41:45,626] Trial 90 finished with value: 4534.641303285158 and parameters: {'fit_intercept': True, 'positive': True, 'alpha': 1.1600000000000001, 'max_iter': 7000}. Best is trial 5 with value: 4534.641303285158.
[I 2023-06-12 12:41:46,833] Trial 91 finished with value: 4534.641303285158 and parameters: {'fit_intercept': True, 'positive': False, 'alpha': 0.38, 'max_iter': 9000}. Best is trial 5 with value: 4534.641303285158.
[I 2023-06-12 12:41:47,936] Trial 92 finished with value: 4534.641303285158 and parameters: {'fit_intercept': True, 'positive': True, 'alpha': 1.83, 'max_iter': 12000}. Best is trial 5 with value: 4534.641303285158.
[I 2023-06-12 12:41:49,018] Trial 93 finished with value: 4855.019597835598 and parameters: {'fit_intercept': False, 'positive': True, 'alpha': 0.91, 'max_iter': 1000}. Best is trial 5 with value: 4534.641303285158.
[I 2023-06-12 12:41:50,126] Trial 94 finished with value: 4855.019597835598 and parameters: {'fit_intercept': False, 'positive': True, 'alpha': 2.75, 'max_iter': 6000}. Best is trial 5 with value: 4534.641303285158.
[I 2023-06-12 12:41:51,275] Trial 95 finished with value: 4534.641303285158 and parameters: {'fit_intercept': True, 'positive': True, 'alpha': 0.11, 'max_iter': 6000}. Best is trial 5 with value: 4534.641303285158.
[I 2023-06-12 12:41:52,334] Trial 96 finished with value: 4855.019597835598 and parameters: {'fit_intercept': False, 'positive': False, 'alpha': 0.85, 'max_iter': 5000}. Best is trial 5 with value: 4534.641303285158.
[I 2023-06-12 12:41:53,425] Trial 97 finished with value: 4855.019597835598 and parameters: {'fit_intercept': False, 'positive': True, 'alpha': 0.25, 'max_iter': 14000}. Best is trial 5 with value: 4534.641303285158.
[I 2023-06-12 12:41:54,557] Trial 98 finished with value: 4855.019597835598 and parameters: {'fit_intercept': False, 'positive': False, 'alpha': 1.19, 'max_iter': 14000}. Best is trial 5 with value: 4534.641303285158.
[I 2023-06-12 12:41:55,731] Trial 99 finished with value: 4855.019597835598 and parameters: {'fit_intercept': False, 'positive': False, 'alpha': 1.75, 'max_iter': 13000}. Best is trial 5 with value: 4534.641303285158.
FrozenTrial(number=5, state=TrialState.COMPLETE, values=[4534.641303285158], datetime_start=datetime.datetime(2023, 6, 12, 12, 40, 9, 824016), datetime_complete=datetime.datetime(2023, 6, 12, 12, 40, 10, 913769), params={'fit_intercept': True, 'positive': False, 'alpha': 2.69, 'max_iter': 3000}, user_attrs={}, system_attrs={}, intermediate_values={}, distributions={'fit_intercept': CategoricalDistribution(choices=(False, True)), 'positive': CategoricalDistribution(choices=(False, True)), 'alpha': FloatDistribution(high=3.0, log=False, low=0.01, step=0.01), 'max_iter': IntDistribution(high=15000, log=False, low=1000, step=1000)}, trial_id=5, value=None)
Запишем показатели в нашу таблицу.
study_ridge_regression.best_params
{'fit_intercept': True, 'positive': False, 'alpha': 2.69, 'max_iter': 3000}
pipeline_ridge_regresion = Pipeline([('preprocessor', preprocessor_linear)
, ('regressor', Ridge(alpha=study_ridge_regression.best_params['alpha']
, fit_intercept=True # fit_intercept
# , copy_X=True
, max_iter=study_ridge_regression.best_params['max_iter']
# , tol=0.0001
, solver='lsqr'
# , positive=False
# , random_state=None
)
)
]
)
i = measure_for_models(features_train, target_train, features_valid, target_valid
, pipeline_ridge_regresion
, 'Ridge. Параметры:'+str(study_ridge_regression.best_params)
, i
)
results_table
| model_or_pipeline | fit_wall_time | predict_wall_time | rmse_valid_result | |
|---|---|---|---|---|
| 0 | LinearRegression. Параметры:{'fit_intercept': True, 'positive': False} | 0.50 | 0.22 | 3133.76 |
| 1 | Ridge. Параметры:{'fit_intercept': True, 'positive': False, 'alpha': 2.69, 'max_iter': 3000} | 0.49 | 0.29 | 4529.75 |
Подберём параметры для SGDRegressor. Эта модель будет пользоваться стохастическим градиентным спуском и регуляризаций.
def objective(trial):
# fit_intercept = trial.suggest_categorical('fit_intercept', [False, True])
# positive = trial.suggest_categorical('positive', [False, True])
max_iter = trial.suggest_int('max_iter', 1000, 15000, step=1000)
alpha = trial.suggest_float('alpha', 0.01, 3
, step = 0.01
# , log=True
)
pipeline_sgd_regresion = Pipeline([('preprocessor', preprocessor_linear)
, ('regressor', SGDRegressor(loss='squared_epsilon_insensitive'
, penalty='l2'
, alpha=alpha
, l1_ratio=0.15
# , fit_intercept=True
, max_iter=max_iter
, tol=0.001
, shuffle=True
# , verbose=0
, epsilon=0.1
, random_state=rs
, learning_rate='invscaling'
, eta0=0.01
, power_t=0.25
, early_stopping=False
, validation_fraction=0.1
, n_iter_no_change=5
, warm_start=False
, average=False))])
pipeline_sgd_regresion.fit(features_train, target_train)
# prediction_valid = pipeline_sgd_regresion.predict(features_valid)
# rmse = mean_squared_error(target_valid, prediction_valid, squared=False)
rmse = mean_squared_error(target_train, pipeline_sgd_regresion.predict(features_train), squared=False)
return rmse
study_sgd_regression = optuna.create_study(direction='minimize'
, sampler=optuna.samplers.RandomSampler(seed=rs)
, pruner=optuna.pruners.MedianPruner(n_warmup_steps=10)
)
study_sgd_regression.optimize(objective, n_trials=100, timeout=600)
study_sgd_regression.best_trial
[I 2023-06-12 12:41:56,838] A new study created in memory with name: no-name-2cf45dc4-7583-42a9-8384-dd8dab12e3eb
[I 2023-06-12 12:42:19,450] Trial 0 finished with value: 2.7175333176563892e+16 and parameters: {'max_iter': 14000, 'alpha': 0.24000000000000002}. Best is trial 0 with value: 2.7175333176563892e+16.
[I 2023-06-12 12:42:41,241] Trial 1 finished with value: 2.7376612834510644e+16 and parameters: {'max_iter': 15000, 'alpha': 0.75}. Best is trial 0 with value: 2.7175333176563892e+16.
[I 2023-06-12 12:43:04,025] Trial 2 finished with value: 2.796955129945068e+16 and parameters: {'max_iter': 12000, 'alpha': 1.59}. Best is trial 0 with value: 2.7175333176563892e+16.
[I 2023-06-12 12:43:25,590] Trial 3 finished with value: 2.754301829657366e+16 and parameters: {'max_iter': 14000, 'alpha': 2.6599999999999997}. Best is trial 0 with value: 2.7175333176563892e+16.
[I 2023-06-12 12:43:47,538] Trial 4 finished with value: 2.7937241281280172e+16 and parameters: {'max_iter': 2000, 'alpha': 1.56}. Best is trial 0 with value: 2.7175333176563892e+16.
[I 2023-06-12 12:44:09,530] Trial 5 finished with value: 2.719691244201727e+16 and parameters: {'max_iter': 6000, 'alpha': 0.64}. Best is trial 0 with value: 2.7175333176563892e+16.
[I 2023-06-12 12:44:32,211] Trial 6 finished with value: 2.746422794683331e+16 and parameters: {'max_iter': 6000, 'alpha': 0.8200000000000001}. Best is trial 0 with value: 2.7175333176563892e+16.
[I 2023-06-12 12:44:56,772] Trial 7 finished with value: 2.7895322911205188e+16 and parameters: {'max_iter': 12000, 'alpha': 1.44}. Best is trial 0 with value: 2.7175333176563892e+16.
[I 2023-06-12 12:45:20,843] Trial 8 finished with value: 2.7466770161597284e+16 and parameters: {'max_iter': 2000, 'alpha': 0.8300000000000001}. Best is trial 0 with value: 2.7175333176563892e+16.
[I 2023-06-12 12:45:45,093] Trial 9 finished with value: 2.7933298688970976e+16 and parameters: {'max_iter': 12000, 'alpha': 1.55}. Best is trial 0 with value: 2.7175333176563892e+16.
[I 2023-06-12 12:46:07,948] Trial 10 finished with value: 2.7401120044264068e+16 and parameters: {'max_iter': 7000, 'alpha': 0.77}. Best is trial 0 with value: 2.7175333176563892e+16.
[I 2023-06-12 12:46:30,329] Trial 11 finished with value: 2.7668295060454644e+16 and parameters: {'max_iter': 2000, 'alpha': 2.48}. Best is trial 0 with value: 2.7175333176563892e+16.
[I 2023-06-12 12:46:53,307] Trial 12 finished with value: 2.724527006552338e+16 and parameters: {'max_iter': 5000, 'alpha': 0.46}. Best is trial 0 with value: 2.7175333176563892e+16.
[I 2023-06-12 12:47:16,464] Trial 13 finished with value: 2.7551543258857692e+16 and parameters: {'max_iter': 4000, 'alpha': 2.7399999999999998}. Best is trial 0 with value: 2.7175333176563892e+16.
[I 2023-06-12 12:47:38,695] Trial 14 finished with value: 2.7585464445582144e+16 and parameters: {'max_iter': 1000, 'alpha': 1.1300000000000001}. Best is trial 0 with value: 2.7175333176563892e+16.
[I 2023-06-12 12:48:01,373] Trial 15 finished with value: 2.7583708117803464e+16 and parameters: {'max_iter': 5000, 'alpha': 1.07}. Best is trial 0 with value: 2.7175333176563892e+16.
[I 2023-06-12 12:48:23,781] Trial 16 finished with value: 2.7639170287108904e+16 and parameters: {'max_iter': 15000, 'alpha': 1.17}. Best is trial 0 with value: 2.7175333176563892e+16.
[I 2023-06-12 12:48:46,555] Trial 17 finished with value: 2.7274052914083268e+16 and parameters: {'max_iter': 14000, 'alpha': 0.44}. Best is trial 0 with value: 2.7175333176563892e+16.
[I 2023-06-12 12:49:10,026] Trial 18 finished with value: 2.7513077528742936e+16 and parameters: {'max_iter': 8000, 'alpha': 2.78}. Best is trial 0 with value: 2.7175333176563892e+16.
[I 2023-06-12 12:49:33,630] Trial 19 finished with value: 2.763396631317476e+16 and parameters: {'max_iter': 13000, 'alpha': 2.55}. Best is trial 0 with value: 2.7175333176563892e+16.
[I 2023-06-12 12:49:56,729] Trial 20 finished with value: 2.728498764969842e+16 and parameters: {'max_iter': 10000, 'alpha': 0.43}. Best is trial 0 with value: 2.7175333176563892e+16.
[I 2023-06-12 12:50:20,329] Trial 21 finished with value: 2.7175333176563892e+16 and parameters: {'max_iter': 2000, 'alpha': 0.24000000000000002}. Best is trial 0 with value: 2.7175333176563892e+16.
[I 2023-06-12 12:50:45,425] Trial 22 finished with value: 2.7274052914083268e+16 and parameters: {'max_iter': 6000, 'alpha': 0.44}. Best is trial 0 with value: 2.7175333176563892e+16.
[I 2023-06-12 12:51:08,946] Trial 23 finished with value: 2.7583708117803464e+16 and parameters: {'max_iter': 7000, 'alpha': 1.07}. Best is trial 0 with value: 2.7175333176563892e+16.
[I 2023-06-12 12:51:32,158] Trial 24 finished with value: 2.770657414732953e+16 and parameters: {'max_iter': 7000, 'alpha': 2.4099999999999997}. Best is trial 0 with value: 2.7175333176563892e+16.
[I 2023-06-12 12:51:53,729] Trial 25 finished with value: 2.705652773809033e+16 and parameters: {'max_iter': 6000, 'alpha': 0.15000000000000002}. Best is trial 25 with value: 2.705652773809033e+16.
[I 2023-06-12 12:52:15,317] Trial 26 finished with value: 2.7788708061813364e+16 and parameters: {'max_iter': 13000, 'alpha': 1.8800000000000001}. Best is trial 25 with value: 2.705652773809033e+16.
FrozenTrial(number=25, state=TrialState.COMPLETE, values=[2.705652773809033e+16], datetime_start=datetime.datetime(2023, 6, 12, 12, 51, 32, 159114), datetime_complete=datetime.datetime(2023, 6, 12, 12, 51, 53, 729005), params={'max_iter': 6000, 'alpha': 0.15000000000000002}, user_attrs={}, system_attrs={}, intermediate_values={}, distributions={'max_iter': IntDistribution(high=15000, log=False, low=1000, step=1000), 'alpha': FloatDistribution(high=3.0, log=False, low=0.01, step=0.01)}, trial_id=25, value=None)
Запишем результаты в нашу таблицу.
study_sgd_regression.best_params
{'max_iter': 6000, 'alpha': 0.15000000000000002}
pipeline_sgd_regresion = Pipeline([('preprocessor', preprocessor_linear)
, ('regressor', SGDRegressor(loss='squared_epsilon_insensitive'
, penalty='l2'
, alpha=study_sgd_regression.best_params['alpha']
, l1_ratio=0.15
, max_iter=study_sgd_regression.best_params['max_iter']
, tol=0.001
, shuffle=True
, epsilon=0.1
, random_state=rs
, learning_rate='invscaling'
, eta0=0.01
, power_t=0.25
, early_stopping=False
, validation_fraction=0.1
, n_iter_no_change=5
, warm_start=False
, average=False))])
i = measure_for_models(features_train, target_train, features_valid, target_valid
, pipeline_sgd_regresion
, 'SGDRegressor. Параметры:'+str(study_sgd_regression.best_params)
, i
)
results_table
| model_or_pipeline | fit_wall_time | predict_wall_time | rmse_valid_result | |
|---|---|---|---|---|
| 0 | LinearRegression. Параметры:{'fit_intercept': True, 'positive': False} | 0.50 | 0.22 | 3133.76 |
| 1 | Ridge. Параметры:{'fit_intercept': True, 'positive': False, 'alpha': 2.69, 'max_iter': 3000} | 0.49 | 0.29 | 4529.75 |
| 2 | SGDRegressor. Параметры:{'max_iter': 6000, 'alpha': 0.15000000000000002} | 21.00 | 0.11 | 27112617069432432.00 |
Подберем параметры для Дерева Решений. Здесь и далее базовым алгоритмом будет Дерево решений.
def objective(trial):
max_depth = trial.suggest_int('max_depth', 3, 15)
pipeline_tree_regression = Pipeline([('preprocessor', preprocessor_tree)
, ('regressor', DecisionTreeRegressor(random_state=rs
, max_depth=max_depth
))])
pipeline_tree_regression.fit(features_train, target_train)
rmse = mean_squared_error(target_train, pipeline_tree_regression.predict(features_train), squared=False)
return rmse
study = optuna.create_study(direction='minimize'
, sampler=optuna.samplers.RandomSampler(seed=rs)
, pruner=optuna.pruners.MedianPruner(n_warmup_steps=10)
)
study.optimize(objective, n_trials=13)
study.best_trial
[I 2023-06-12 12:52:36,640] A new study created in memory with name: no-name-4e19fa53-c8ea-436b-b717-a1ad451f6f57
[I 2023-06-12 12:52:38,147] Trial 0 finished with value: 1539.5304012850925 and parameters: {'max_depth': 14}. Best is trial 0 with value: 1539.5304012850925.
[I 2023-06-12 12:52:39,086] Trial 1 finished with value: 2641.1247466181685 and parameters: {'max_depth': 4}. Best is trial 0 with value: 1539.5304012850925.
[I 2023-06-12 12:52:40,652] Trial 2 finished with value: 1412.0165250943885 and parameters: {'max_depth': 15}. Best is trial 2 with value: 1412.0165250943885.
[I 2023-06-12 12:52:41,742] Trial 3 finished with value: 2354.0058604803316 and parameters: {'max_depth': 6}. Best is trial 2 with value: 1412.0165250943885.
[I 2023-06-12 12:52:43,114] Trial 4 finished with value: 1777.226026475628 and parameters: {'max_depth': 12}. Best is trial 2 with value: 1412.0165250943885.
[I 2023-06-12 12:52:44,436] Trial 5 finished with value: 2075.177778029109 and parameters: {'max_depth': 9}. Best is trial 2 with value: 1412.0165250943885.
[I 2023-06-12 12:52:45,947] Trial 6 finished with value: 1539.5304012850925 and parameters: {'max_depth': 14}. Best is trial 2 with value: 1412.0165250943885.
[I 2023-06-12 12:52:47,457] Trial 7 finished with value: 1539.5304012850925 and parameters: {'max_depth': 14}. Best is trial 2 with value: 1412.0165250943885.
[I 2023-06-12 12:52:48,416] Trial 8 finished with value: 2641.1247466181685 and parameters: {'max_depth': 4}. Best is trial 2 with value: 1412.0165250943885.
[I 2023-06-12 12:52:49,633] Trial 9 finished with value: 2075.177778029109 and parameters: {'max_depth': 9}. Best is trial 2 with value: 1412.0165250943885.
[I 2023-06-12 12:52:50,770] Trial 10 finished with value: 2256.7091812558556 and parameters: {'max_depth': 7}. Best is trial 2 with value: 1412.0165250943885.
[I 2023-06-12 12:52:51,860] Trial 11 finished with value: 2483.771347937754 and parameters: {'max_depth': 5}. Best is trial 2 with value: 1412.0165250943885.
[I 2023-06-12 12:52:52,990] Trial 12 finished with value: 2256.7091812558556 and parameters: {'max_depth': 7}. Best is trial 2 with value: 1412.0165250943885.
FrozenTrial(number=2, state=TrialState.COMPLETE, values=[1412.0165250943885], datetime_start=datetime.datetime(2023, 6, 12, 12, 52, 39, 87047), datetime_complete=datetime.datetime(2023, 6, 12, 12, 52, 40, 651910), params={'max_depth': 15}, user_attrs={}, system_attrs={}, intermediate_values={}, distributions={'max_depth': IntDistribution(high=15, log=False, low=3, step=1)}, trial_id=2, value=None)
Запишем показатели в нашу таблицу.
study.best_params
{'max_depth': 15}
pipeline_tree_regression = Pipeline([('preprocessor', preprocessor_tree)
, ('regressor', DecisionTreeRegressor(random_state=rs
, max_depth=study.best_params['max_depth']
))])
i = measure_for_models(features_train, target_train, features_valid, target_valid
, pipeline_tree_regression
, 'DecisionTreeRegressor. Параметры:'+str(study.best_params)
, i
)
results_table
| model_or_pipeline | fit_wall_time | predict_wall_time | rmse_valid_result | |
|---|---|---|---|---|
| 0 | LinearRegression. Параметры:{'fit_intercept': True, 'positive': False} | 0.50 | 0.22 | 3133.76 |
| 1 | Ridge. Параметры:{'fit_intercept': True, 'positive': False, 'alpha': 2.69, 'max_iter': 3000} | 0.49 | 0.29 | 4529.75 |
| 2 | SGDRegressor. Параметры:{'max_iter': 6000, 'alpha': 0.15000000000000002} | 21.00 | 0.11 | 27112617069432432.00 |
| 3 | DecisionTreeRegressor. Параметры:{'max_depth': 15} | 1.28 | 0.13 | 2078.77 |
Рассмотрим модель Случайного леса. Подберем для нее параметры.
def objective(trial):
max_depth = trial.suggest_int('max_depth', 3, 15)
n_estimators = trial.suggest_int('n_estimators', 10, 400, step=10)
pipeline_forest_regression = Pipeline([('preprocessor', preprocessor_tree)
, ('regressor', RandomForestRegressor(n_estimators=100
, max_depth=max_depth
, n_jobs=-1
, random_state=rs
))])
pipeline_forest_regression.fit(features_train, target_train)
rmse = mean_squared_error(target_train, pipeline_forest_regression.predict(features_train), squared=False)
return rmse
study = optuna.create_study(direction='minimize'
, sampler=optuna.samplers.RandomSampler(seed=rs)
, pruner=optuna.pruners.MedianPruner(n_warmup_steps=10)
)
study.optimize(objective, n_trials=100, timeout=600)
study.best_trial
[I 2023-06-12 12:52:54,489] A new study created in memory with name: no-name-02e8dd62-49c8-4da5-9983-dc7df31636d3
[I 2023-06-12 12:53:50,922] Trial 0 finished with value: 1431.9356042649674 and parameters: {'max_depth': 14, 'n_estimators': 40}. Best is trial 0 with value: 1431.9356042649674.
[I 2023-06-12 12:54:52,776] Trial 1 finished with value: 1318.4707874966264 and parameters: {'max_depth': 15, 'n_estimators': 100}. Best is trial 1 with value: 1318.4707874966264.
[I 2023-06-12 12:55:44,812] Trial 2 finished with value: 1665.2573652166075 and parameters: {'max_depth': 12, 'n_estimators': 220}. Best is trial 1 with value: 1318.4707874966264.
[I 2023-06-12 12:56:43,285] Trial 3 finished with value: 1431.9356042649674 and parameters: {'max_depth': 14, 'n_estimators': 360}. Best is trial 1 with value: 1318.4707874966264.
[I 2023-06-12 12:57:04,482] Trial 4 finished with value: 2596.782392116026 and parameters: {'max_depth': 4, 'n_estimators': 210}. Best is trial 1 with value: 1318.4707874966264.
[I 2023-06-12 12:57:37,643] Trial 5 finished with value: 2194.2933493433966 and parameters: {'max_depth': 7, 'n_estimators': 90}. Best is trial 1 with value: 1318.4707874966264.
[I 2023-06-12 12:58:09,901] Trial 6 finished with value: 2194.2933493433966 and parameters: {'max_depth': 7, 'n_estimators': 110}. Best is trial 1 with value: 1318.4707874966264.
[I 2023-06-12 12:58:57,660] Trial 7 finished with value: 1665.2573652166075 and parameters: {'max_depth': 12, 'n_estimators': 200}. Best is trial 1 with value: 1318.4707874966264.
[I 2023-06-12 12:59:16,351] Trial 8 finished with value: 2596.782392116026 and parameters: {'max_depth': 4, 'n_estimators': 120}. Best is trial 1 with value: 1318.4707874966264.
[I 2023-06-12 13:00:07,652] Trial 9 finished with value: 1548.1292745494375 and parameters: {'max_depth': 13, 'n_estimators': 210}. Best is trial 1 with value: 1318.4707874966264.
[I 2023-06-12 13:00:44,975] Trial 10 finished with value: 2093.2721342677546 and parameters: {'max_depth': 8, 'n_estimators': 110}. Best is trial 1 with value: 1318.4707874966264.
[I 2023-06-12 13:01:07,132] Trial 11 finished with value: 2596.782392116026 and parameters: {'max_depth': 4, 'n_estimators': 330}. Best is trial 1 with value: 1318.4707874966264.
[I 2023-06-12 13:01:39,381] Trial 12 finished with value: 2194.2933493433966 and parameters: {'max_depth': 7, 'n_estimators': 70}. Best is trial 1 with value: 1318.4707874966264.
[I 2023-06-12 13:02:05,285] Trial 13 finished with value: 2437.143330569846 and parameters: {'max_depth': 5, 'n_estimators': 370}. Best is trial 1 with value: 1318.4707874966264.
[I 2023-06-12 13:02:25,459] Trial 14 finished with value: 2931.8589194462065 and parameters: {'max_depth': 3, 'n_estimators': 160}. Best is trial 1 with value: 1318.4707874966264.
[I 2023-06-12 13:03:02,453] Trial 15 finished with value: 2194.2933493433966 and parameters: {'max_depth': 7, 'n_estimators': 150}. Best is trial 1 with value: 1318.4707874966264.
FrozenTrial(number=1, state=TrialState.COMPLETE, values=[1318.4707874966264], datetime_start=datetime.datetime(2023, 6, 12, 12, 53, 50, 923693), datetime_complete=datetime.datetime(2023, 6, 12, 12, 54, 52, 776381), params={'max_depth': 15, 'n_estimators': 100}, user_attrs={}, system_attrs={}, intermediate_values={}, distributions={'max_depth': IntDistribution(high=15, log=False, low=3, step=1), 'n_estimators': IntDistribution(high=400, log=False, low=10, step=10)}, trial_id=1, value=None)
Запишем параметры в нашу таблицу.
study.best_params
{'max_depth': 15, 'n_estimators': 100}
pipeline_forest_regression = Pipeline([('preprocessor', preprocessor_tree)
, ('regressor', RandomForestRegressor(n_estimators=study.best_params['n_estimators']
, max_depth=study.best_params['max_depth']
, n_jobs=-1
, random_state=rs
))])
i = measure_for_models(features_train, target_train, features_valid, target_valid
, pipeline_forest_regression
, 'RandomForestRegressor. Параметры:'+str(study.best_params)
, i
)
results_table
| model_or_pipeline | fit_wall_time | predict_wall_time | rmse_valid_result | |
|---|---|---|---|---|
| 0 | LinearRegression. Параметры:{'fit_intercept': True, 'positive': False} | 0.50 | 0.22 | 3133.76 |
| 1 | Ridge. Параметры:{'fit_intercept': True, 'positive': False, 'alpha': 2.69, 'max_iter': 3000} | 0.49 | 0.29 | 4529.75 |
| 2 | SGDRegressor. Параметры:{'max_iter': 6000, 'alpha': 0.15000000000000002} | 21.00 | 0.11 | 27112617069432432.00 |
| 3 | DecisionTreeRegressor. Параметры:{'max_depth': 15} | 1.28 | 0.13 | 2078.77 |
| 4 | RandomForestRegressor. Параметры:{'max_depth': 15, 'n_estimators': 100} | 60.40 | 1.25 | 1769.28 |
Подберём параметры для LightGBM модели, которая работает по алгоритму градиентного бустинга.
def objective(trial):
n_estimators = trial.suggest_int('n_estimators', 10, 400, step=10)
max_depth = trial.suggest_int('max_depth', 3, 15)
min_split_gain = trial.suggest_float('min_split_gain', 0, 15)
num_leaves = trial.suggest_int('num_leaves', 20, 3000, step=20)
pipeline_lgbm_regressor = Pipeline([
('preprocessor', preprocessor_tree),
# ('preprocessor', num_preprocessor_tree),
('regressor', lgb.LGBMRegressor(boosting_type='gbdt'
, num_leaves=num_leaves
, max_depth=max_depth
, learning_rate=0.1
, n_estimators=n_estimators
# , subsample_for_bin=200000
# , objective=None
# , class_weight=None
, min_split_gain=min_split_gain
# , min_child_weight=0.001
# , min_child_samples=20
# , subsample=1.0
# , subsample_freq=0
# , colsample_bytree=1.0
# , reg_alpha=0.0
# , reg_lambda=0.0
, random_state=rs
, n_jobs=-1
# , importance_type='split'
)
)
]
)
pipeline_lgbm_regressor.fit(features_train
, target_train
, regressor__categorical_feature=category_features_index
)
# prediction_valid = pipeline_lgbm_regressor.predict(features_valid)
# rmse = mean_squared_error(target_valid, prediction_valid, squared=False)
rmse = mean_squared_error(target_train, pipeline_lgbm_regressor.predict(features_train), squared=False)
return rmse
study = optuna.create_study(direction='minimize'
, sampler=optuna.samplers.RandomSampler(seed=rs)
, pruner=optuna.pruners.MedianPruner(n_warmup_steps=10)
)
study.optimize(objective, n_trials=100, timeout=600)
study.best_trial
[I 2023-06-12 13:04:04,180] A new study created in memory with name: no-name-a93ee833-3925-41c0-a0d1-6f7f078de03d
[I 2023-06-12 13:11:45,560] Trial 0 finished with value: 1628.252152672761 and parameters: {'n_estimators': 360, 'max_depth': 4, 'min_split_gain': 14.694692359684636, 'num_leaves': 760}. Best is trial 0 with value: 1628.252152672761.
[I 2023-06-12 13:39:58,848] Trial 1 finished with value: 1293.936171071441 and parameters: {'n_estimators': 310, 'max_depth': 9, 'min_split_gain': 13.613306298597603, 'num_leaves': 2660}. Best is trial 1 with value: 1293.936171071441.
FrozenTrial(number=1, state=TrialState.COMPLETE, values=[1293.936171071441], datetime_start=datetime.datetime(2023, 6, 12, 13, 11, 45, 561590), datetime_complete=datetime.datetime(2023, 6, 12, 13, 39, 58, 847844), params={'n_estimators': 310, 'max_depth': 9, 'min_split_gain': 13.613306298597603, 'num_leaves': 2660}, user_attrs={}, system_attrs={}, intermediate_values={}, distributions={'n_estimators': IntDistribution(high=400, log=False, low=10, step=10), 'max_depth': IntDistribution(high=15, log=False, low=3, step=1), 'min_split_gain': FloatDistribution(high=15.0, log=False, low=0.0, step=None), 'num_leaves': IntDistribution(high=3000, log=False, low=20, step=20)}, trial_id=1, value=None)
study.best_params
{'n_estimators': 310,
'max_depth': 9,
'min_split_gain': 13.613306298597603,
'num_leaves': 2660}
pipeline_lgbm_regressor = Pipeline([
('preprocessor', preprocessor_tree), ('regressor', lgb.LGBMRegressor(boosting_type='gbdt'
, num_leaves=study.best_params['num_leaves']
, max_depth=study.best_params['max_depth']
, learning_rate=0.1
, n_estimators=study.best_params['n_estimators']
, min_split_gain=study.best_params['min_split_gain']
, random_state=rs
, n_jobs=-1
))])
i = measure_for_lgbm(features_train, target_train, features_valid, target_valid
, pipeline_lgbm_regressor
, 'LGBMRegressor. Параметры:'+str(study.best_params)
, i
, category_features_index
)
results_table
| model_or_pipeline | fit_wall_time | predict_wall_time | rmse_valid_result | |
|---|---|---|---|---|
| 0 | LinearRegression. Параметры:{'fit_intercept': True, 'positive': False} | 0.50 | 0.22 | 3133.76 |
| 1 | Ridge. Параметры:{'fit_intercept': True, 'positive': False, 'alpha': 2.69, 'max_iter': 3000} | 0.49 | 0.29 | 4529.75 |
| 2 | SGDRegressor. Параметры:{'max_iter': 6000, 'alpha': 0.15000000000000002} | 21.00 | 0.11 | 27112617069432432.00 |
| 3 | DecisionTreeRegressor. Параметры:{'max_depth': 15} | 1.28 | 0.13 | 2078.77 |
| 4 | RandomForestRegressor. Параметры:{'max_depth': 15, 'n_estimators': 100} | 60.40 | 1.25 | 1769.28 |
| 5 | LGBMRegressor. Параметры:{'n_estimators': 310, 'max_depth': 9, 'min_split_gain': 13.613306298597603, 'num_leaves': 2660} | 971.28 | 10.91 | 1691.10 |
results_table
| model_or_pipeline | fit_wall_time | predict_wall_time | rmse_valid_result | |
|---|---|---|---|---|
| 0 | LinearRegression. Параметры:{'fit_intercept': True, 'positive': False} | 0.50 | 0.22 | 3133.76 |
| 1 | Ridge. Параметры:{'fit_intercept': True, 'positive': False, 'alpha': 2.69, 'max_iter': 3000} | 0.49 | 0.29 | 4529.75 |
| 2 | SGDRegressor. Параметры:{'max_iter': 6000, 'alpha': 0.15000000000000002} | 21.00 | 0.11 | 27112617069432432.00 |
| 3 | DecisionTreeRegressor. Параметры:{'max_depth': 15} | 1.28 | 0.13 | 2078.77 |
| 4 | RandomForestRegressor. Параметры:{'max_depth': 15, 'n_estimators': 100} | 60.40 | 1.25 | 1769.28 |
| 5 | LGBMRegressor. Параметры:{'n_estimators': 310, 'max_depth': 9, 'min_split_gain': 13.613306298597603, 'num_leaves': 2660} | 971.28 | 10.91 | 1691.10 |
Как мы видим на наших данных модели с базовым алгоритмом Дерева решений работают лучше и укладываются в ограничение RMSE<2500. А модели на основе линейного алгоритма работают быстрее и при обучении и при предсказаниях.
Если бы для нас приоритетными критериями были время обучения и время предсказания, то при достаточной RMSE я бы победителем назначил Дерево решений.
Но, чтобы результат RMSE был более стабильным, а время обучения и предсказания сохраняли важность, победителем будет модель Случайного Леса, хотя результат модели градиентного бустинга LightGBM лучший, но уж больно долго его ждать.
Важно: Наши модели сожержат предобработку данных. И время указанное в таблице - это время настенных часов. Оценки этого времени не противоречат теоретическим пропорциям.
pipeline_
Pipeline(steps=[('preprocessor',
ColumnTransformer(remainder='passthrough',
transformers=[('num',
Pipeline(steps=[('imputer',
SimpleImputer(fill_value=-1000,
strategy='constant'))]),
['registration_year', 'power',
'kilometer']),
('cat',
Pipeline(steps=[('imputer',
SimpleImputer(fill_value='unknown',
strategy='constant')),
('encoder',
OrdinalEncoder(dtype=<class 'numpy.uint32'>,
handle_unknown='ignore'))]),
['gearbox', 'repaired',
'model', 'vehicle_type',
'fuel_type', 'brand'])])),
('regressor',
LGBMRegressor(max_depth=9, min_split_gain=13.613306298597603,
n_estimators=310, num_leaves=2660,
random_state=321))])
И так самое главное: проверка на тестовой выборке.
mean_squared_error(target_test
, pipeline_forest_regression.predict(features_test)
, squared=False)
1765.36536624984
%%timeit
mean_squared_error(target_test
, pipeline_forest_regression.predict(features_test)
, squared=False)
1.15 s ± 47.3 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
Выводы:
Задание проекта выполнено. Мы нашли модель соотвествующую критериям заказчика:
Более того мы встроили в модель предобработку данных.
pipeline_forest_regression
Pipeline(steps=[('preprocessor',
ColumnTransformer(remainder='passthrough',
transformers=[('num',
Pipeline(steps=[('imputer',
SimpleImputer(fill_value=-1000,
strategy='constant'))]),
['registration_year', 'power',
'kilometer']),
('cat',
Pipeline(steps=[('imputer',
SimpleImputer(fill_value='unknown',
strategy='constant')),
('encoder',
OrdinalEncoder(dtype=<class 'numpy.uint32'>,
handle_unknown='ignore'))]),
['gearbox', 'repaired',
'model', 'vehicle_type',
'fuel_type', 'brand'])])),
('regressor',
RandomForestRegressor(max_depth=15, n_jobs=-1,
random_state=321))])
Наша модель обучена с учётом категориальных признаков.
category_features
['gearbox', 'repaired', 'model', 'vehicle_type', 'fuel_type', 'brand']
а точнее их индексов
category_features_index
[3, 9, 5, 1, 7, 8]
Перед обученим модели мы сделали следующую работу:
Что не получилось:
К сожалению, выяснилось что модели в доступных нам библиотеках трудно строго отнести к теоретически разработанным концепциям.
Мне не удалось строго заполнить таблицу моделями:
Например: Верояно модель LinearRegresson работает в том числе с примененим градинтного спуска, возможно даже стохастического, это видно из времени на обучение.
Следовательно, полноценно соотнести теоретические расчеты асимптотической сложности и настенного времени в полной мере не удалось в рамках этого проекта и ограниченого времени.
Поэтому во время выполнения задания было принято решение сократить количество моделей. Замечательно, что фактические разультаты не противоречили теоретическим ожиданиям.